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Abstract 

In this paper a first attempt at deriving an improved performance 
measure for language models, the probability ratio measure (PRM) is 
described. In a proof of concept experiment, it is shown that PRM 
correlates better with recognition accuracy and can lead to better 
recognition results when used as the optimisation criterion of a clus- 
tering algorithm. Inspite of the approximations and limitations of this 
preliminary work, the results are very encouraging and should justify 
more work along the same lines. 

1 Introduction 



The perplexity measure is currently used in the speech recognition 
and language modelling community for the following purposes 

1) To evaluate the quality of a language model. 

a) When comparing two language models, the one that has 
lower perplexity is chosen as the better one. 

b) When optimising some parameter of a language model (e.g. 
interpolation weight, discounting parameter, etc.), a param- 
eter value that minimises perplexity is chosen. 
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2) To decide on the diffieulty of a given recognition task. 

The main problems with using the perplexity factor for each of 
these scenarios are 

1) Perplexity is not always well correlated with recognition accu- 
racy. In other words it can happen that language model LMl has 
lower perplexity than model LM2, but also has lower recognition 
accuracy. In the case of la) above, this means that a subopti- 
mal language model gets used, thus leading to lower recognition 
performance. In the case of lb) above, this means that the lan- 
guage model parameters can get tuned according to a suboptimal 
criterion. 

2) Two tasks, which have similar perplexities, can nevertheless have 
very different recognition accuracies. 

The first of these problems especially is of considerable importance 
when working on language modeling. There are many modifications 
to language models that lead to a considerable improvement in per- 
plexity, but to a very small or insignificant increase in recognition 
accuracy. This implies that full recognition experiments need to be 
performed before the value of a proposed modification can be properly 
determined. This can be very time consuming, thereby slowing down 
the progress in the language modeling field. 

The above mentioned problems with the perplexity measure have 
been known for some time. Yet perplexity is still widely used to eval- 
uate the performance of language models. The main reason for that 
is that no better, widely accepted alternative measures exist. In this 
paper, a likely cause of the problems of the perplexity measure will be 
explored and directions for alternative measures will be suggested. 

2 Analysis 

One possible explanation for the above problems can be described 
intuitively as follows. The crucial property of a good language model 
is not that the probability of the correct word is very high in absolute 
terms, but that it is higher than acoustically confusable words. 

This can be further illustrated with the following example. Sup- 
pose the task at hand has a vocabulary of four words {are,bar, cookie, dinner}. 
LMl gives those words the probabilities 0.4, 0.3, 0.2 and 0.1, whereas 
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LM2 assigns 0.1, 0.2, 0.3 and 0.4. Now suppose that, based on acous- 
tic information alone, instances of 'bar' match the acoustic models for 
'bar' and 'are' quite well. Then LMl would be more likely to mis- 
recognise 'bar' as 'are' even though the absolute probability value it 
assigns to 'bar' (0.3) is higher than that of LM2 (0.2). 

This problem is related to a fundamental equation of speech recog- 
nition used to find the best hypothesis. Given the acoustic data A, 
a recogniser chooses the words sequence that is most likely to 
correspond to A, e.g. it chooses 



Given the distributions p{W) and p(yl|VF), this is the best choice and 
leads to minimal error rate. However, this equation does not spec- 
ify how the distributions should be estimated. Currently p(W) and 
p(^|W^) are estimated independently of each other. Ideally, however, 
one would estimate p{W) and p(A|VF) together in order to minimise 
error rate. In theory, one could take account of p(^|M^) when esti- 
mating p{W) or vice versa. In this paper, however, only the former 
will be explored. 

Let denote the correct word sequence. The recogniser will 
chose the correct word sequence if 



Currently, when the language model probabilities p{W) are estimated, 
one tries to maximise the likelihood of the training text p(Wtrain)- One 
can argue that if p(W) is large for correct word sequences (like the 
ones in the training text), the above equation is more likely to hold 
and one is therefore less likely to make errors. 

Following the intuitive explanation of the problems with the per- 
plexity measure, however, one should try to make sure that the like- 
lihood of the correct word sequence is higher than its acoustically 
confusable alternatives. Ideally, one would therefore like to optimise 
something like 



= argmaxwp{W) *p{A\W). 



(1) 



piW") * p{A\W^) > p{W) *p{A\w)yw. 



(2) 



p{W'')*p{A\W'') 
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Y.wP{W)*p{A\W) 

If is assumed to be negligible for acoustically dissimilar 

W , this can be approximated by equation above. 



3 Experiments and Results 

In order to explore the ideas mentioned above, several experiments 
were performed on an Airborne Reconnaissance Mission (ARM) task, 
which has a vocabulary of about 500 words. These experiments are 
aimed at the problems la) and lb) mentioned above and are only 
intended as a proof of concept. In order to minimise the amount of 
coding necessary, only bigram experiments were performed. Further- 
more, rather than looking at confusable word sequences in general, 
only one-to-one substitutions were considered. Thus, equation ^ was 
further approximated as follows. 

Let and W be two word sequences which only differ in word 
J, e.g. w1 = Wi except for i = j. The ratio of their probabilities is 



p{W')*p{A\W'^) _ 

p{W)*p{A\W) ^ ' 

I{,p{w'i\wU)*P{A\W') 
Y{iP{wi\wi_i) *p{A\W) 
p{w^j\w'j_^) * p{w^jj^i\w^) *p{A\W^) 
p{wj\w'^j_i) * p{w^j^]\wj) *p{A\W) 

If we further approximate the ratio of the acoustic probabilities 
of the two word sequences by a similarity measure Sim{wj,Wj), one 
obtains 



p{W'^)*p{A\W<^) _ 

p{W)*p{A\W) ^ ' 

= "i"'!1-!'"''1"'!'°'! ,5M«.;,«..) 

p{wj\w''^_^)*p{w''^^^\wj) ■> 
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Using this as the basic component, the average, normahsed ratio 
of probabihties of the correct word over its confusable alternatives is 



Extending this to substitutions at any point in the correct word 
sequence, the probabihty ratio measure, PRM, is calculated as 



n( n , (8) 

By rearranging terms, this can be further rewritten as 



TT^ TT P{Wi\wf_^) * Sim{w1,Wi) -j-j- p{w1\w1_i) ^Y/\simUarwi.'i_\ 

w similarWi ^ ' t—Ly siinilarwi—i ^ ' ' 

(9) 



The similarity measure Sim{w'i,Wi) was calculated based on the 
human equivalent noise ratio (HENR) measure as described in [Q] . The 
number of acoustically confusable words NbSimil was varied from 
(in which case the normal perplexity measure is calculated) to 80. 

In a first experiment, recognition experiments were run on 30 test- 
ing utterances. The standard perplexity measure, and the probability 
ratio measure (PRM) given by equation ^ were also calculated on 
each of these files. The correlation between the recognition accuracy 
and the different measures were then calculated using Spearmans rank 
order correlation coefficient rg (from the numerical recipes book Q, 
pp.507ff). It essentially calculates the linear correlation coefficient of 
the set of points where the actual sample values are replaced by their 
rank among all the samples. The results are shown in Table |l], for 
different numbers of similar words considered. 

Lower perplexities do in general lead to higher accuracy, which ex- 
plains the negative value of the correlation coefficient for NbSimil = 0, 
e.g. the normal perplexity measure. On the other hand, higher ra- 
tio measure values lead in general to higher accuracies, which is why 
the corresponding correlation coefficients are positive for all other en- 
tries, corresponding to various parameter settings of the PRM mea- 
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NbSimil 


Correlation Coefficient rs 


(e.g. perplexity) 


-0.42 


10 


0.47 


20 


0.50 


40 


0.46 


80 


0.45 



Table 1: Correlation coefficient rs of recognition accuracies and the proba- 
bility ratio measure on thirty utterances for various numbers of acoustically 
confusable words considered 

sure. When looking at the absolute value of the correlation coeffi- 
cients, which is the actual value of importance, one can see that the 
ratio measure is more correlated than the perplexity measure. 

In a second experiment, equation |9| was used as optimisation func- 
tion of a clustering algorithm (instead of the usual one, which is de- 
rived from the perplexity). This is done in a manner similar to 
where a modified optimisation criterion is derived for a different pur- 
pose. The results are shown in Table ^, for different numbers of similar 
words considered. Depending on the number of similar words used, 
recognition results that are better than those using the normal opti- 
misation criterion can be obtained. 

4 Conclusions and Future Work 

In this paper a first attempt at deriving an improved performance 
measure for language models, the probability ratio measure (PRM) is 
described. In a proof of concept experiment, it is shown that PRM 



NbSimil 


Recognition Accuracy (in %) 


(e.g. perplexity) 


87.1 


10 


85.9 


20 


87.5 


40 


87.2 



Table 2: Recognition accuracy for clustered language models built with op- 
timisation functions derived from the probability ratio measure for various 
numbers of acoustically confusable words considered 
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correlates better with recognition accuracy and can lead to better 
recognition results when used as the optimisation criterion of a clus- 
tering algorithm. However, it is intended as a proof of concept only 
and therefore has many limitations. The acoustic similarity measure 
used, for example, is not derived from actual Hidden Markov Mod- 
els. Moreover, only pairwise confusions rather than complete conf us- 
able paths with substitutions and insertions were considered. In spite 
of these severe approximations, the results arc very encouraging and 
should justify more work along the same lines, possibly addressing the 
limitations mentioned above. 
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